Analysis of Internet Dating and Relationship Longevity

Oladotun Oladimejij

Outline

  1. Introduction
    • Background Information
    • Required tools
  2. Data Collection
    • About the Data
    • Loading & Tidying the Data
  3. Data Analysis & Visulization
    • Looking at the Data as a Whole
    • Lets Look More Closely at People Who Met Online
    • What If We Only Looked At the Past Decade?
  4. Machine Learning
    • Organizing Data
    • Splitting Data
    • Model Building & Training
      • Non-Linear SVM
      • Linear SVM
  5. Final Thoughts

Introduction

In the past two decades along with the rise of the internet there has been a rise in online dating. Even though online dating is very popular, having conversations with your friends or looking on social media you will probably hear people say online dating is superficial and does not really result in long lasting relationships.

Throught out this tutorial I would like to analyze how couples who met online relationships measure up to couples who did not meet online, by anaylzing features like relationship quality, age, the year they met, and other demographics. I also want to see if we will be ablle to learn how long a relationship will last based on how they met, as well as other features mentioned earlier.

Background Information

According to the The Virtues and Downsides of Online Dating, 30% of U.S. adults say they have used a dating site or app. And of those who use online dating as means to meet people, about six in ten people say their experience was positive, where they were able to find people who they were attracted to and found people that have shared interests. Which means at least short term online dating might not be as badd

That being said I want to examine how long a relationship that starts online can compare with more traditional ways of starting relationships.

Required Tools

Data Collection

About the Data

This data is a collection of survey responses collected by Standford in 2017, also building of the data they collected in 2009, in which respondents get to describe how they met their partner, their relationship, and if and why they stayed together, including features like relationship_quality, the year they met, and more. This dataset has 3,510 survey respondents, and 285 columns.

Dataset used: https://data.stanford.edu/hcmst2017#download-data

Loading and Tidying Data

The first thing we are going to do is to pandas to read in the dta file and save it as a dataframe. Then take a peek at the head. There are over 200 columns most of which we will not use, but I found it easier to not drop the unneeded columns and later just put the needed columns into one dataframe with needed columns, but for now lets leave as it is.

Renaming Columns

There are alot of columns that are named to correspond to the number of the question in the survey, so I renamed them to columns that are more intuitive. I grouped the columns into how they renamed the columns by met and other, because I wanted to use the met.vallues() to iterate to the dataframe and group data visualisation and grouping in later steps

Storing Values for Analysis Later

Encoding Values

I need to transfer the qualitative responses into quantitative so I can create an input that can be respresented as a vector in the machine learning portions later.

The first I will encoded is how the couple met. The data already separates how the couples met into separate columns where each entry is either "yes" or "no" but I will code yes = 1 and no = 0.

The function coded_values() takes in a data frame and a column in the dataframe specifically referring to the column indicating how they met and creates the an array that appends 1 if the value in that row in the column is 'yes' and 0 if the value is 'no'. The function add_coded_values() takes in a dataframe with the list of ways_met and changes all the values in ways met to a coded value. The way we get the ways_met list is from the values of the met dict defined earlier. Where all missing or null values set to NaN

I will be coding the relationship quality in two ways. The first being from 0 to 4, where 0 represents a very poor relationship and 4 represents an excellent relationship. Where missing values as NaN.

I also encoding the realationship quality as a vector, to be use as an input for machine learning prediction in later steps. It sets the value equal to 1 if the realtionship quality is that type for example "fair" and 0 if not.

Adjusting Data Types and Adding Columns

Below I will convert some colums to numeric values for data analysis and creation of new columns.

Above I created the column that will take the age difference between couples, because I wanted to determine the relationship quality and duration between the age difference.

Below I am just adding column that says the way the couple met, so i will be able to group the dataframe by ways met later.

After initial tidying this is how the dataframe looks.

Data Analysis & Visulization

Graphing Tools

I will be using these functions below to plot the data.

The met_freq() and met_freq2() will tell how many times a way the a couple met a specific way. This will be useful for the bar graphs make_graphs() will generate later. met_freq() counts the frequency if values were numeric and met_freq2() counts values of str values.

The function make_graphs() separates the graphs by decade and shows the most popular ways that couples met eachother in that year.

The regression_plot() takes in a dataframe, with x and y values, their corresponding labels, and finally if the data points should be annotated or note. The regression plot with a line of best fit.

Looking At Data As a Whole

Before we start looking at specfic ways the couple met, I would like to see if we could notice trends from data over all. Before we get further I would like to talk about the column 'relationship_duration'. I will tell you that relationship duration was calculated by when there relationship ended subtracted from when the couple first started their relationship. Though if couples were still together at the time of July 2017 that would be considered the " end " for analysis purpose. So we are clear that the relationship duration does not only relate to couples who have broken up.

This a clear linear relationship, which makes sense, the less you have known someone the smaller the length of your relationship.

The graph above plots Age Difference between Couples vs Relationship Duration both in years. There seems to be a strong correlation as couples that smaller the age difference the more likely the are to stay together

The graph above plots Age the Respondent first met their partner between vs Relationship Duration both in years. There seems to be a strong correlation as couples that smaller the younger they the more likely the longer their relationship will last.

The graph above plots Time before their meeting before relationship starting vs Relationship Duration both in years. It shows that though the regression line is not that steep, it appears the longer it takes for coupkes to get into a relationship the shorter their relationship might be.

I wanted to use a violin plot to see how the way the couples met and their relationship duration. It seems like for couples that met online, they seemed to have a shorter relationship duration than the other methods, but it seems the values seemed to be more widely distributed.

How Has How People Met their Partner's Changed Over Time?

There is a tool in pandas that allows you to cut the dataframe into bins of equal sizes. I did not use that here because I wanted to split the data by decade, and if I used pandas.cut() it would not have represented the data like I wanted it do, it is a bit more work but this can show a part of how our dating patterns as a society changed.

At the top of the graph will be the most popular way couples met, and the least populIt will be interesting to see how the 'met_online' category moves, throughout each decade.

Meeting online was virtually nonexistent until the 1960s. Which I wonder if that is an error or not because the internet did not become public til the 1990s, there could be other reasons like they had access to the internet before then but that was not how the internet was use in that time, so I am assuming it was either an error made by the respondent or the repondent was not talking about that year, but I checked and the year did actually correspond to years in the 1960s.

You will noticed that in the 1990s is when online dating realing started to take off. Which makes sense, because that is when the interenet became public, in the last decade meeting online became the most popular way to meet someone.

Analyzing Statistics

Now we will be grouping the data by the way the couples met and taking the average of the groupings.

I am adjusting the column labels to correspond to what the column actually represents, which is the average.

When I am taking the average of relationship quality, i am looking at the values I coded from 0 to 4, and taking the average of those values.

You can see that met_online is an outlier with the lowest Average Relationship Duration and the couples who met online on average tend to be older. The relationship between average relationship and age when they met and they seem to have a neagtive correlation.

The graph above also shows met_online as an outlier but the relationship between average time befeore defining the relationship and the average relationship duration does not seem to have very correlate at all as the slope of the regression line is almost horizontal.

The graph above shows the average age difference between couples and relationship duration and seemed to be negatively correlated, like shown above when looking at the data as a wholw, the greater the age difference the lower the relationship duration.

Now I want to learn some patterns. Above is the frequency of people that got married to partner they are talking about in the survey. I notice that people who met online are more likely to not get married other couples. I think this interesting because meeting online is a very new. To see that they are kind of tied to couples that met in primary or secondary school.

The graph above shows that most ways that most couples met are happy in their relationship regardless of how they met eachother.

Of relationships of people who did not get married and ended the group that broke up the most were couples that met online and couples thet met in primary or secondary school.

I wanted to analayze if the country the couples met in affect their relationship quality and length of their relationship, it is interesting to note that people that met in the United States, could had more negative view of their relationship than other countries.

Looking closely at People Who Met Online

I wanted to see how couples met online are affected by features like age difference, age when met, time before they defined the relationship started, and relationship quality to relationship duration.

Here I am just gathering how many ways people met online.

As I expected internet dating is the most popular way people on the internet met,which makes sense withthe rise of dating apps like Tinder, Bumble, and others. After interner dating, internet_other follows, and I am not sure what that entails but probably ways that are not as conventional.

Most couples that met online view their relationship as excellent and in a positive manner.

Looking at the graph aboveThe respondents that were female tended to see their relationship more negatively than than the male respondents did.

The age group that sees the relationship poorly more than other groups that met online seem to be in the 25-34 group in 2017 which is interesting because they are the age group that grew up on the internet

The age when they first met and their relationship correlation does not seem to be very correlated.

The regression line seems to be correlated but the plots cluster around the younger values and seem to have a longer relationship duration.

What If We Focused on the Last Decade?

If we go back to when we looked at the data as whole, when plotting the regression plots of the averages met_online was an outlier, but I wonder if that was because of all the time other ways_met had ahead of the online era. In the last decade we found that met_online was the most popular way couples met in 2010. I wonder how dating has been impacted in the online era, as well as the other ways.

Compared to the violin plot that looks at all the data, the data after 2010 looks more evenly distributed, but met_online still seemed to have a shorter relationship duration.

If we are looking at the graphs, met_online is less of an outlier than before. The only graphs seemed to be negatively correlated, but the relationship quality graph had a an outlier that made the regression line steeper than it needed to be for the graph the displays Averahe Relationship v. Relationship Duration.

Machine Learning

Organizing the Data

The function clean_data() takes in a dataframe and removes the rows that have nan and infinite values to ignore so it would not affect the data learning model inputs

Earlier I mentioned that I would not be dropping over 200 columns, and only grab the columns I need to for the inputs of the data and put them into a new dataframe. Called mini is the a smaller version of the big data datframe.

Splitting Data

Non-Linear SVM

Non-Linear SVM (Support Vector Machine), is SVM can be extended to solve nonlinear regression tasks when the set of samples cannot be separated linearly.

Linear SVM

Non-Linear SVM (Support Vector Machine), is SVM can be extended to solve linear regression tasks when the set of samples can be separated linearly.


Since the linear SVM had a higher accuracy than non linear SVM it must have performed better, there also so seemed to be less noise since the RMSE is smaller. Which might insinuate the data is linearly seperable but I am a little skepectical as to why that accuarcy is so high for linear SVM, but I think has to do with the years met column being highly correlated to relationship duration, because without it the accuarcy is about .32 and the RMSE is much higher for each regression model. Which leads me to think the other features were not as highly correlated with the duration of the relationship.

Final Thoughts

Based on my analysis relationships that started online do not really last long if you were looking at the dataset as whole. That being said because there is some relationship between the ways couples met and how long their relation lasted, along with age difference, the year they met, and their relationship quality since the accuarcy, but I would not consider it strong because the accuarcy for non linear svm is about .240 each on the test data. The both had an RMSE of about 14.

As for online dating, it may not last the longest, if you look at the data as whole, but those who met online tended to view their relationships to have an excellent quality. If you look in the past decade people who met online are similar to other relationships in terms of relationship quality, and not to far off in duration either. So online dating is not drastically better or worse than other meeting options in the past decade.

Citations

Anderson, M., Vogels, E. A., & Turner, E. (2020, October 2). The virtues and downsides of online dating.
Pew Research Center: Internet, Science & Tech. Retrieved December 20, 2021, from
https://www.pewresearch.org/internet/2020/02/06/the-virtues-and-downsides-of-online-dating/

How couples meet and stay together 2017 (HCMST2017). How Couples Meet and Stay Together 2017 (HCMST2017) | SSDS
Social Science Data Collection. (n.d.). Retrieved December 20, 2021, from
https://data.stanford.edu/hcmst2017#download-data

Pupale, R. (2019, February 11). Support vector machines(svm) - an overview. Medium. Retrieved December 20,
2021, from https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989